{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/data_connectors/simple_directory_reader.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple Directory Reader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `SimpleDirectoryReader` is the most commonly used data connector that _just works_.  \n",
    "Simply pass in a input directory or a list of files.  \n",
    "It will select the best file reader based on the file extensions.  "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Started"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.\n",
      "ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.\n",
      "--2023-12-21 16:34:53--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 75042 (73K) [text/plain]\n",
      "Saving to: ‘data/paul_graham/paul_graham_essay1.txt’\n",
      "\n",
      "data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.04s   \n",
      "\n",
      "2023-12-21 16:34:53 (1.81 MB/s) - ‘data/paul_graham/paul_graham_essay1.txt’ saved [75042/75042]\n",
      "\n",
      "Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.\n",
      "ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.\n",
      "--2023-12-21 16:34:53--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 75042 (73K) [text/plain]\n",
      "Saving to: ‘data/paul_graham/paul_graham_essay2.txt’\n",
      "\n",
      "data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.04s   \n",
      "\n",
      "2023-12-21 16:34:53 (1.78 MB/s) - ‘data/paul_graham/paul_graham_essay2.txt’ saved [75042/75042]\n",
      "\n",
      "Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.\n",
      "ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.\n",
      "--2023-12-21 16:34:54--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 75042 (73K) [text/plain]\n",
      "Saving to: ‘data/paul_graham/paul_graham_essay3.txt’\n",
      "\n",
      "data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.04s   \n",
      "\n",
      "2023-12-21 16:34:54 (1.80 MB/s) - ‘data/paul_graham/paul_graham_essay3.txt’ saved [75042/75042]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import SimpleDirectoryReader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load specific files "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reader = SimpleDirectoryReader(\n",
    "    input_files=[\"./data/paul_graham/paul_graham_essay1.txt\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 1 docs\n"
     ]
    }
   ],
   "source": [
    "docs = reader.load_data()\n",
    "print(f\"Loaded {len(docs)} docs\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load all (top-level) files from directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reader = SimpleDirectoryReader(input_dir=\"./data/paul_graham/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 3 docs\n"
     ]
    }
   ],
   "source": [
    "docs = reader.load_data()\n",
    "print(f\"Loaded {len(docs)} docs\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load all (recursive) files from directory "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/nested'\n",
    "!echo \"This is a nested file\" > 'data/paul_graham/nested/nested_file.md'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# only load markdown files\n",
    "required_exts = [\".md\"]\n",
    "\n",
    "reader = SimpleDirectoryReader(\n",
    "    input_dir=\"./data\",\n",
    "    required_exts=required_exts,\n",
    "    recursive=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 1 docs\n"
     ]
    }
   ],
   "source": [
    "docs = reader.load_data()\n",
    "print(f\"Loaded {len(docs)} docs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create an iterator to load files and process them as they load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n"
     ]
    }
   ],
   "source": [
    "reader = SimpleDirectoryReader(\n",
    "    input_dir=\"./data\",\n",
    "    recursive=True,\n",
    ")\n",
    "\n",
    "all_docs = []\n",
    "for docs in reader.iter_data():\n",
    "    for doc in docs:\n",
    "        # do something with the doc\n",
    "        doc.text = doc.text.upper()\n",
    "        all_docs.append(doc)\n",
    "\n",
    "print(len(all_docs))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Full Configuration"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the full list of arguments that can be passed to the `SimpleDirectoryReader`:\n",
    "\n",
    "```python\n",
    "class SimpleDirectoryReader(BaseReader):\n",
    "    \"\"\"Simple directory reader.\n",
    "\n",
    "    Load files from file directory. \n",
    "    Automatically select the best file reader given file extensions.\n",
    "\n",
    "\n",
    "    Args:\n",
    "        input_dir (str): Path to the directory.\n",
    "        input_files (List): List of file paths to read\n",
    "            (Optional; overrides input_dir, exclude)\n",
    "        exclude (List): glob of python file paths to exclude (Optional)\n",
    "        exclude_hidden (bool): Whether to exclude hidden files (dotfiles).\n",
    "        encoding (str): Encoding of the files.\n",
    "            Default is utf-8.\n",
    "        errors (str): how encoding and decoding errors are to be handled,\n",
    "                see https://docs.python.org/3/library/functions.html#open\n",
    "        recursive (bool): Whether to recursively search in subdirectories.\n",
    "            False by default.\n",
    "        filename_as_id (bool): Whether to use the filename as the document id.\n",
    "            False by default.\n",
    "        required_exts (Optional[List[str]]): List of required extensions.\n",
    "            Default is None.\n",
    "        file_extractor (Optional[Dict[str, BaseReader]]): A mapping of file\n",
    "            extension to a BaseReader class that specifies how to convert that file\n",
    "            to text. If not specified, use default from DEFAULT_FILE_READER_CLS.\n",
    "        num_files_limit (Optional[int]): Maximum number of files to read.\n",
    "            Default is None.\n",
    "        file_metadata (Optional[Callable[str, Dict]]): A function that takes\n",
    "            in a filename and returns a Dict of metadata for the Document.\n",
    "            Default is None.\n",
    "\"\"\"\n",
    "```\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
